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Abstract 

This paper discusses a new family of bounds for use in similarity search, related 
to those used in metric indexing, but based on Ptolemy's inequality, rather than 
the metric axioms. Ptolemy's inequality holds for the well-known Euclidean dis- 
tance, but is also shown here to hold for quadratic form metrics in general. In 
addition, the square root of any metric is Ptolemaic, which means that the prin- 
ciples introduced in this paper have a very wide applicability. The inequality 
is examined empirically on both synthetic and real-world data sets and is also 
found to hold approximately, with a very low degree of error, for important 
distances such as the angular pseudometric and several L p norms. Indexing ex- 
periments are performed on several data sets, demonstrating a highly increased 
filtering power when using certain forms of Ptolemaic filtering, compared to ex- 
isting, triangular methods. It is also shown that combining the Ptolemaic and 
triangular filtering can lead to better results than using either approach on its 
own. 

Key words: Algorithms, Data Structures, Information Retrieval, Metric 
Indexing, Similarity Retrieval, Ptolemy's Inequality 



1. Introduction 

In similarity search, data objects are retrieved based on their similarity to a 
query object; as for other modes of information retrieval, the related indexing 
methods seek to improve the efficiency the search. Two approaches seem to 
dominate the field: Spatial access methods [1-3], based on coordinate geome- 
try, and metric access methods [4-7], based on the metric axioms. Similarity 
retrieval with spatial access methods is often restricted to L p norms, or other 
norms with predictable behavior in M. k , while metric access methods are de- 
signed to work with a broader class of distances — basically any distance that 
satisfies the triangular inequality This gives the metric approach a wider field 
of application, by foregoing some assumptions about the data, in some cases 
resulting in lower performance. Interestingly, even in cases where spatial ac- 
cess methods are applicable, metric indexing may be superior in dealing with 
high-dimensional data, because of its ability to bypass the so-called represen- 
tational dimensionality and deal with the intrinsic dimensionality of the data 
directly [see, e.g., 8]. 
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It seems clear that there are advantages both to making strong assumptions 
about the data, as in the spatial approach, and in working directly with the dis- 
tances, as in the metric approach. The direction taken in this paper is to apply 
a new set of restrictions to the distance, separate from the metric axioms, still 
without making any kind of coordinate-based assumptions. The method that 
is introduced is shown to hold a potential for highly increased filtering power, 
and while it holds for the square root for any metric, it is also shown analyti- 
cally to apply to quadratic form metrics in general (with Euclidean distance as 
an important special case), and empirically, as an approximation with a very 
low degree of error, to important cases such as the angular pseudometric, edit 
distance between strings, and several L p norms. 

2. Basic Concepts 

The similarity in similarity retrieval is usually formalized, inversely, using 
a dissimilarity function, a nonnegative real- valued function d(- , •) over some 
universe of objects, U. For typographic convenience, I will usually abbreviate 
d(x,y) to xy, for all x,y e U. A distance query consists of a query object q, 
and some form of distance threshold, cither given as a range (radius) r, or as 
a neighbor count k. For the range query, all objects o for which qo < r are 
returned; for the fc-nearest-neighbor query (fcNN), the k nearest neighbors of q 
are returned, breaking ties arbitrarily. The fcNN queries can be implemented in 
terms of range queries in a quite general manner [5] . 

For a function to qualify as a dissimilarity function (or premetric) the value 
of xx must be zero. It is generally assumed (mainly for convenience) that 
the dissimilarity is symmetric {xy — yx, giving us a distance) and isolating 
(xy = x = y, yielding a semimetric). From a metric indexing perspective, 
it is most crucial that the distance be subadditive, or triangular (obeying the 
triangular inequality, xz < xy + yz). A metric is any symmetric, isolating, 
triangular dissimilarity function. Metrics are related to the concept of norms: 
A norm metric is a metric of the form xy = \\x — y\\, where x and y are vectors 
and || • || is a norm. 

Many distances satisfy the metric properties (including several important 
distances over sets, strings and vectors), and metric indexing is the main ap- 
proach in distance based (as opposed to coordinate based) retrieval. The metric- 
ity (triangularity, first and foremost) is exploited to construct lower bounds for 
efficient filtering and partitioning. As discussed in an earlier tutorial [7], there 
are several ways of using the triangular inequality in metric indexing, usually 
by pre-computing the distances between certain sample objects (called pivots 
or centers, depending on their use) and the other objects in the data set. By 
computing the distances between the query and the same sample objects (or 
some of them) , triangularity can be used to filter out objects that clearly do not 
satisfy the criteria of the query. 

Even though most distance based indexing has focused on the metric axioms, 
this paper focuses on another property, known as Ptolemy's inequality, which 
entails that 

xv ■ yu < xy ■ uv + xu ■ yv , (1) 
for all objects x,y,u,v. Premetrics satisfying this inequality are called Ptole- 
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(b) xv ■ yu < xy ■ uv + xu ■ yv 



(a) xz < xy + yz 

Figure 1: An illustration of (a) the triangular inequality and (b) Ptolemy's inequality 



maic. 1 In the terms of the Euclidean plane: For any quadrilateral, the sum of 
the pairwise products of opposing sides is greater than or equal to the product 
of the diagonals (see Fig. 1). 

It is a well-known fact that Euclidean distance is Ptolemaic [10]. What is 
perhaps less well known is that there is a natural connection between Ptolemaic 
metrics on vector spaces and a generalization of Euclidean distance — a family of 
distances collectively referred to as quadratic form distance. A quadratic form 
distance may be expressed like this: 

d(x,y) = y / ELi E"=i aij(xi - y l )(x j - y ) , 

or, in matrix notation, Vz'Az, where x and y are vectors, and z = x — y. The 
weight matrix A = [a^] is a measure of "unrclatcdness" between the dimen- 
sions, which uniquely defines the distance. We can, without loss of generality, 
assume that A is symmetric, as any antisymmetries will have no bearing on 
the distance [11]. In order for the distance to be a metric, A must also be 
positive-definite . 2 

Quadratic form distances take into account possible correlations between the 
dimensions of the vector space, which make them especially suited for comparing 
histograms [see, e.g., 12]. For example, when comparing color histograms, it 
might be natural to compare the bin (that is, dimension) for orange to those of 
red and yellow [11]. If A is restricted to a diagonal matrix, a weighted Euclidean 
distance results, with the identity matrix leading to ordinary Euclidean distance 
(see Fig. 2) . The fact that this important family of distances is usable with the 
techniques presented in this paper is expressed in the following theorem. 

Theorem 1. A distance function is a quadratic form metric if and only if it is 
a Ptolemaic norm metric. 



1 Note that Ptolemy's inequality neither implies nor is implied by triangularity [9]. 

2 Some sources require only positive-semidefiniteness [e.g., 6], but this would result in a 
pseudomctric, allowing a distance of zero between different objects. It is possible to relax the 
requirement on A by adding requirements to the inputs [11]. 



3 



(a) Euclidean (b) Weighted Euclidean (c) Quadratic form 

Figure 2: Metric balls for three related distance types in M 2 



Proof. Let the quadratic form metric d(x, y) be Vz'Az, where z = x y. 
Because A is symmetric positive-definite, z'Az defines an inner product, making 
any such d a norm metric based on an inner product norm (a norm of the form 
||z|| = (z, z), where ( • , • ) is an inner product). Note that the converse also 
holds: Any inner product norm can be expressed as a quadratic form metric 
(with a positive-definite A). It is known that a norm metric is Ptolemaic if 
and only if its norm is an inner product norm [10, 13], which gives us the 
theorem. □ 

Beyond this, there are certainly many other Ptolemaic metrics (with the 
discrete metric, d(x,y) = 1 O i ^ )/, as an obvious example). In fact, for 
any metric d, the metric \fd is Ptolemaic [14]. In terms of distance orderings, 
and therefore similarity queries, this new metric is equivalent to the original, 
meaning that the techniques in this paper are applicable to all metrics. However, 
the transform will increase its intrinsic dimensionality [15], making the process 
of indexing harder. As shown in Sect. 6.2, several non-Ptolemaic metrics seem 
to be "sufficiently Ptolemaic" to be used with the Ptolemaic indexing techniques 
without such a transform. 

3. Related Work 

As discussed at length elsewhere [3-7], there are many published methods 
that deal with indexing distances based on the metric properties. While there 
has been an increasing focus on reducing i/o and general CPU time, the pri- 
mary aim of most publications has been minimizing the number of distance 
computations, based on the assumption that the distance is highly expensive 
to compute. 3 While focusing exclusively on this one performance criterion may 
not be altogether realistic (yielding, for example, methods with linear query 
time [17, 18] or quadratic memory use [19, 20]), it has proved a useful foun- 
dation on which methods with more nuanced performance properties could be 



3 This assumption may very well stem from the seminal work of Feustel and Shapiro [16], 
where calculation of the metric involved comparing every permutations of the node sets of 
two graphs. 
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built [e.g., 21-24]. This paper also focuses on minimizing distance computations, 
setting aside related questions of algorithm engineering for later. 

The main mechanism through which distance calculations may be avoided 
is through various forms of filtering or exclusion, using lower bounds [7]. 4 If the 
structure of the data objects is known, very precise, yet cheap, lower bounds 
can be constructed [25], but for metric indexing, only the general properties 
of the distance may be used. As described in Sect. 2, this is done by storing 
precomputed distances, or ranges of distances, involving the data set and certain 
sample objects (centers or pivots). 5 

Rather than listing existing indexing methods, only the most relevant of the 
basic principles will be addressed here. Two rather general theorems contain 
the majority of the indexing principles as special cases. The first of these, 
dealing with metric balls, is given here without proof. The second, dealing with 
generalized hyperplanes, as well as more details and proofs for both theorems 
can be found in the aforementioned tutorial, the survey by Hjaltason and Samet 
[5] or the textbook by Zezula et al. [6]. In the following, let o, p and q be objects 
in a universe U, and let the implicit distance d be a metric over U. 

Theorem 2. The value of qo may be bounded as 

max{ [op\ - \qp] , [qp] - \op] } < qo < \qp] + \op] , 
where [uv\ < uv < \uv~\, for any objects u, v inV. □ 

The expressions [uv\ and \uv~\ refer to known lower and upper bounds for 
uv, as, in some cases, these distances may not be fully known. For example, p 
may be the center of a metric ball containing o, with covering radius r. In that 
case [po\ = and \po] = r. If the query-pivot distance is known, we get the 
lower bound 

qp-r <qo, (2) 

which is exactly the bound used to check for overlap between a query ball and 
a bounding ball in a metric tree, for example. 

4. Ptolemaic Pivot Filtering 

In a manner similar to Theorem 2, Ptolemy's inequality may be used to 
construct lower bounds for filtering. In the following, the technique known as 
pivoting is used. The derivation of a more general bound is deferred to Sect. 5. 

Triangular pivoting is based on the following lower bound, a special case of 
the one in Theorem 2, where the distances are known exactly: 

qo> \qp— op\ (3) 

Here, q is the query object, o is a candidate result object, while p is a so-called 
pivot object, whose function is to help construct the bound. 



4 The converse, inclusion using upper bounds, is also possible, but less frequently useful, 
simply because most of the data should normally be excluded from the result set. 

5 In the case of generalized hyperplane indexing, the information stored is simply which 
center is closest to a given object. 
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This bound follows from basic restructuring of the triangular inequality: 

op + qo> qp 

qo>qp-op, (4) 

and, in the same manner, 

qo> op — qp. (5) 

Together, (4) and (5) lead directly to (3). The bound is normally strengthened 
by using a set of several pivots, P: 

qo > max \qp — op\ 

pep 

A similar derivation can be made for Ptolemaic metrics. In the following, q, 
p, and o retain their previous meaning, but we also add another pivot object, s: 

qs ■ op + qo ■ ps > qp ■ os 

qo ■ ps > qp ■ os — qs ■ op 

qo > (qp ■ os — qs ■ op) /ps (6) 

Here we can maximize over all pairs of pivots: 

qp-os-qs-op 
qo > max (7) 

p.sep ps 

By exchanging p and s in (6), we could exchange the terms in the numerator, 
allowing us to use the absolute value in (7) . This would not strengthen the bound 
as it stands, but would allow us halve the number of pivot pairs examined. 

For the normal pivoting bound to be useful, the pivot should closer to cither 
the query or to the candidate object; the difference in the two distances is what 
gives the bound its filtering power. For the Ptolemaic pivoting bound, it seems 
that one way of getting good results would be to have one pivot close to the 
query, while the other is close to the candidate object, giving a high value for 
the numerator in (7). However, this intuition is tempered by the denominator, 
which dictates that the pivots should also be close to each other. Invariably, 
the tradeoff here will need to be based on empirical considerations. 

As long as the distance matrix between the pivots is precomputed, the bound 
can be computed for every pair of pivots, and the maximum used as the final 
bound. As will be shown in Sect. 6, this Ptolemaic pivoting bound is a significant 
improvement over the classical triangular one. 

The difference between triangular and Ptolemaic pivoting in the Euclidean 
plane is illustrated in Fig. 3, where the ratio between bound and distance from 
a query at (—1,0), with the pivots placed at (0,0) and (1,0), is plotted for each 
point, where zero (a non-informative lower bound) is black and one (a perfect 
bound) is white. 

While the the bound in (7) is the one examined in depth in this paper, the 
ideas of Ptolemaic indexing have wider implications. In the following section, a 
more general theorem (Theorem 3) is given, which is a Ptolemaic analogue of 
Theorem 2. 
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(a) Triangular pivoting (b) Ptolemaic pivoting 

Figure 3: Accuracy, measured as the ratio between the pivoting bound and the actual distance, 
from The three points represent, from left to right, a query and two pivots 



5. Generalizing the Ptolemaic Bound 

The following theorem generalizes the bound (7) , as a Ptolemaic analogue of 
Theorem 2. This generalization is included as a starting point for new indexing 
methods, and is not examined empirically in this paper. 

Theorem 3. Let o, p, q and s be objects in a universe V, and let d be a 
Ptolemaic distance overV. The value of qo may then be bounded as 

w ■ - { w m : m m } w - (W M + M M) ■ 

where [uv\ <uv< \uv~\, for any objects u, v in U. 6 

Proof. The two cases of the lower bound correspond to the two possible order- 
ings of the products in the numerator of the lower bound (7) , both of which are 
permissible. The only change here is that instead of the exact values, we use 
upper and lower limits. The lower limits occur before the subtractions, while 
the upper limits occur after, as well as in the denominator. These substitutions 
can only lower the value of the bound, and hence the inequality still holds. The 
upper bound follows from the Ptolemaic inequality 

ps ■ qo < qp ■ os + qs ■ op . 

Dividing by ps and safely substituting upper limits in the numerator and a lower 
limit in the denominator, we arrive at the upper bound. □ 

The applications of this theorem to pivot filtering have already been dis- 
cussed in Sect. 4. However, its metric analog, Theorem 2, is also used for 



Note that [-J and ["•] are not used to represent floor and ceiling here. 
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overlap checking with balls and shells, which is what the upper and lower limits 
to the distances represent. The notion of overlapping metric balls is inherently 
triangular, and does not directly translate to Ptolemaic distances. We are still 
able to exploit similar information, but we are in the somewhat unusual situa- 
tion of working with two balls (or shells) at once. Take the upper bound, in the 
case where we know qp and qs (and, of course, ps). In order to apply this bound 
(for automatic inclusion of o), we would have to know that o falls inside two 
balls, one around p and one around s, with radii \op] and \os] , respectively. 

While this sort of "double containment" is not the norm in current metric 
indexing methods, it is certainly not impossible to implement. One would simply 
need to let each region be represented by two distance balls, and maintain two 
covering radii — one for each center. There is an inherent tradeoff here: If the 
objects are far apart, the covering radii will necessarily become quite large; 
however, if we move them closer, the upper bound will increase. 

In considering the lower bound, we see something interesting: If we envision 
a structure with covering radii around both p and s, the lower limits [op J and 
[os\ both become zero, leaving us also with a total lower bound of zero. We 
see that, as for the pivot filtering case, we may need for one of the pivots to 
be more "query-like," and the pivots need to be different from each other. For 
example, if the query is close to p but far from s, and the converse holds for 
the object, the lower bound will be high. However, this may not be enough. 
Consider the case where the object falls within a ball with radius r around p 
(giving us [os\ = ps — r); we then get the following variation of the lower bound 
from (2): 

. 1 , / v v qp + qs 

qo> — • (qp ■ (ps — r) — r ■ qs) = qp — r ■ 

ps ps 

The only difference from the triangular condition is the scaling factor (qp + 
qs)/ps, which we can see as regulating the influence of the radius. If the query 
lies directly between p and s (that is, ps = qp + qs) this new bound is, in fact, 
exactly equivalent to the triangular one. However, for all other cases, the new 
bound is worse. 

What's missing is the "other half," as it were: an inverted ball around s, 
excluding o, giving us a proper [os\ , in addition to the covering radius, \op] . 
This would be available in a situation where we have covering shells from each 
pivot to its sibling regions (as in gnat [26] and its descendants), or where we 
use "inside/outside" partitioning with multiple pivots simultaneously (as in D- 
Indcx [27]). In such cases, where both upper and lower bounds are available 
for both op and os, the bound in Theorem 3 can be used directly, substituting 
exact values for ps, qs and qp. 

6. Experimental Results 

Two sets of experiments have been performed to evaluate the usefulness of 
Ptolemy's inequality in distance indexing: The first set of experiments evaluate 
its filtering power (using pivot filtering), while the second evaluates its recall 
rates for non-Ptolemaic metrics, for uses in approximate search. 

6.1. Filtering Power for Ptolemaic Metrics 

The first set of experiments were designed to compare the filtering power 
of the Ptolemaic and triangular approaches. Fig. 4 shows the results for pivot 
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filtering, with both the triangular and Ptolemaic lower bounds. 

The first four data sets used were five- and ten-dimensional vectors. Both 
in five and ten dimensions, one uniformly random and one clustered data set 
were used. In the clustered case, vectors were drawn in equal proportion from 
ten gaussian clusters (with a 2 = 0.1), centered around vectors drawn uniformly 
random from the unit hypercube. This clustering approach is similar to that 
used by Zezula et al. [21]. In addition, a set of 64-dimensional image histograms 
under a quadratic form distance was used, as well as another set of uniformly 
random ten-dimensional vectors, under the angular pseudometric (the angle 
formed by the vectors and the origin, also known as normalized correlation), 
which is equivalent, in terms of retrieval results, to the well-known cosine dis- 
tance. The histogram distance used was similar to that described by Hafner 
et al. [11], using Euclidean distance in RGB space as the basis for the weight 
matrix. 7 The histograms themselves were constructed by posterizing random 
images from an online image repository [28]. In all cases, the number of ob- 
jects was 10,000. The pivot filtering was performed directly on a precomputcd 
distance table, somewhat like in laesa [17]. In addition to the full Ptolemaic 
pivot filtering described in Sect. 4, a version was used that included each pivot 
only once, resulting in the bound being computed on n — 1 pairs (equivalent to 
2n — 2 bounds of the form of (7) , because the absolute value was used) , which 
gives a CPU use closer to that of traditional pivoting. Of course, there is a range 
of settings available here, from a linear number of bounds, to the quadratic 
number used by the full filter. 

The pivots used were selected using the method of Pedreira and Brisaboa 
[18], which has the advantage of adapting the number of pivots automatically 
to the complexity of the data set. Note that this method has been designed to 
select pivots suitable for metric indexing, and may therefore be less suited to 
Ptolemaic methods. 

The plots in Fig. 4 show the total number distance computations needed 
(that is, the number of pivots plus the number of candidate objects that must 
be explicitly examined using the actual distance function). The search radius 
(the horizontal axis) is described using the number of objects it encompasses 
(10-50). In each case, m is the number of pivots used. The experiments were 
averaged over ten randomly generated data sets, 8 each tested with 100 queries 
randomly drawn from the data set. 

As can be seen, for these cases, the full Ptolemaic approach clearly out- 
performs the triangular — especially for higher dimensionalities. 10 The partial 
Ptolemaic filtering also offers significant advantages in several cases. 

Note that the last experiment is approximate: The angular pseudometric is 
not Ptolemaic. Even so, the number of false negatives (indicated by the numbers 
next to each point on the graph) is fairly low, especially for the partial Ptolemaic 
filter. The suitability of such approximations for various other distances is 
explored further in Sect. 6.2. It is worth noting that the Ptolemaic filters used in 
Fig. 4 are pure, that is, not combined with a triangular bound. Fig. 5 illustrates 



7 The matrix used was aij = 1 — dij/dmax, where i and j are histogram bins. 
8 For the color histograms, the data sets were identical. 

9 The pivot objects were specifically excluded from being used as queries, as that would 
automatically have given all bounds optimal performance. 

10 Note that uniform vectors have a higher intrinsic dimensionality than clustered vectors [4]. 
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Figure 4: Distance computations for radii covering 10-50 neighbors (avg. over ten runs w/100 
random queries; n=10, 000). Vectors were drawn randomly (with ten random Gaussian clus- 
ters, a = 0.1, and uniformly) in five and ten dimensions. Except for the last two experiments 
(4(c) and 4(f)), which used a quadratic form distance and the angular pseudomctric, the 
retrieval was performed in Euclidean space 
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Figure 5: Relative filtering power of the triangular and Ptolemaic pivoting bounds for fixed- 
radius 10-NN queries on 10,000 uniformly random ten-dimensional vectors (100 queries, av- 
eraged over ten runs). The region marked "both" represents objects filtered out by both 
heuristics. All the gray regions represent true negatives, while the white region represents 
false positives, where actual distance computations are needed 



the relative contribution of the triangular and Ptolemaic pivoting bounds, if 
both are used in range search where the radius covers the ten nearest objects, 
on 100,000 uniformly random ten-dimensional vectors (averaged over ten data 
sets, each tested with 100 queries drawn randomly from the data set). The 
pivots were chosen uniformly at random from the data set. In this figure, only 
negatives are shown — that is, objects that need to be filtered out before the ten 
nearest neighbors arc left. The four regions represent (1) objects filtered out 
only by the Ptolemaic bound, (2) objects that both bounds are able to filter 
out, (3) objects filtered out only by the triangular bound, and (4) objects that 
neither bound manages to disqualify. 

It is clear that even for a relatively modest number of pivots, the filtering 
power of the Ptolemaic bound is high, and the difference in the exclusive con- 
tributions of the two bounds is surprisingly large. At ten pivots, the Ptolemaic 
bound filters out 93.0 

A progression similar to that in Fig. 5 was also found for clustered data, 
with a somewhat less pronounced difference (data not shown) . 

6.2. Approximation Rates for N on- Ptolemaic Metrics 

Although the proposed approach is strictly correct only for Ptolemaic dis- 
tances, it might also work well as an approximation for other distances; Fig. 4(f) 
already gives some indication of this. While any metric can be made Ptolemaic, 
as mentioned in Sect. 2, the resulting metric may be harder to index, and ap- 
proximate indexing of the original may still be useful. Table 1 shows results for 
several data sets and distances. The three types of objects used were vectors, 
sets and strings. The vectors were generated as described in Sect. 6.1. The 
sets were generated randomly for a given maximum cardinality by including or 
excluding each object with equal probability. The string data sets were a list of 
234,936 words from Webster's Second International Edition, as distributed with 
the Macintosh OS version 10.5, and the lines of A Tale of Two Cities by Charles 
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Tabic 1: Results from approximation experiments. The values given are the average Ptolemaic 
proportions of the quadruples, as well as the standard deviations. Subscripts on numbers 
represent powers of ten. M. k refers to vectors of dimension k, while 2 Zfc are sets of cardinality 
k 



Distance 


Clustered, I 


& 5 


Clustered, R 10 


Uniform, K 


10 












a 




a 




L\ Manhattan distance 


0.98 


1.70. 


-3 


1.00 


3.81_4 


1.00 


8.63. 


-4 


L 2 Euclidean distance* 


1.00 






1.00 




1.00 






L 3 


0.99 


6.39- 


-4 


1.00 


1.11_ 4 


1.00 


4.58. 


-5 


L 5 


0.98 


1.41. 


-3 


1.00 


6.03_ 4 


1.00 


2.14- 


-4 


Lio 


0.97 


1.39 


-3 


0.99 


1-31-3 


1.00 


4.94. 


-4 


Lwo 


0.96 


2.01. 


-3 


0.98 


1.49_3 


1.00 


5.27. 


-4 


ioo Chebyshev distance 


0.96 


2.09. 


-3 


0.98 


1.45_3 


1.00 


5.48. 


-4 


9 Angular distance 


0.99 


5.78- 


-4 


1.00 


1.19-4 


1.00 


4.00. 


-5 




Uniform, 2 1 


ho 


Uniform, 2 Z2 ° 








Hamming distance 


0.93 


2.69. 


-3 


0.98 


9.57_4 








Jaccard distance 


0.99 


8.90- 


-4 


1.00 


I.6O-4 










Webster 




Dickens 








Levenshtein distance 


1.00 


7.81- 


-5 


1.00 


0.00 









* Euclidean distance is Ptolemaic, but is included for completeness 



Dickens, with short lines (fewer than six characters) stripped away. The latter 
data set is similar to one used by Brin [26]. 

For the vector data sets, various L p norms were used, as well as the angular 
pseudometric. For the sets, the related Hamming and Jaccard distances were 
used (the cardinality of the symmetric difference, and proportion of the sym- 
metric difference to the union, respectively), while for the strings, Levenshtein 
distance (edit distance) was used. 

Each experiment was averaged over ten runs. Except for the string data 
sets, which were static, new data sets were randomly generated for each run. 
A run consisted of 10,000 randomly sampled quadruples (that is, sets of four 
distinct objects), and the proportion of the quadruples that satisfied Ptolemy's 
inequality was computed. 

As can be seen from Table 1, the much-used L p norms vary in their approx- 
imation rates from 99 

The set distances also seem to conform to Ptolemy's inequality to a high 
degree for the 20-dimensional case, and the edit distance has a very low rate of 
inequality violations (with the Dickens data set outperforming the Webster data 
set by several orders of magnitude) . For the L p spaces and the angular pseudo- 
metric, the intrinsic dimensionality was generally close to the actual number of 
dimensions (slightly higher for uniformly random vectors, and varying with p) . 
The dimensionalities for the Hamming spaces were equal to about half the set 
cardinalities, while those for Jaccard distance were about 50 
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7. Discussion and Future Work 

In summary, this paper has presented a new distance indexing principle, 
based on Ptolemy's inequality, which seems to result in greatly increased filter- 
ing power, in the cases where it is applicable. For vector spaces (in particular, 
M fc ), it is directly applicable to quadratic form distances, including Euclidean 
distance. It is also applicable to the square root of any metric, and approxi- 
mately applicable, with a very high degree accuracy, to other distances such as 
the angular pseudometric and edit distance over strings. 

While the experiments performed in this paper have shown that the Ptole- 
maic approach has potential for improved filtering, and therefore performance, 
over the purely triangular approach, the surface has barely been scratched when 
it comes to describing the properties of the new bounds. In the following, two 
main avenues for future research are outlined. 

• The performance studies in this paper have the traditional pure filtering 
focus, only counting distance computations. It would be useful to examine 
the performance of Ptolemaic indexing in a more realistic setting, where 
queries arc timed on real-world data. As the full Ptolemaic pivoting is 
computationally more expensive than the triangular version (a quadratic 
versus linear cost in the number of pivots), there would be a tradeoff 
in terms of distance computation cost, pivoting cost and filtering power. 
The number of bounds actually computed, in the spectrum from linear 
to quadratic, would be an important optimization parameter here. It 
might even be feasible to heuristically look for high bounds to compute, 
by examining the distances involved. For real-time testing like this, it 
would also be relevant to compare the distance-based algorithms to spatial 
access methods [1, 2], where possible. 

• It would be interesting to examine the basic properties of the Ptolemaic 
pivots, as related to the space (including which composition of pivots that 
would give the best pivoting results) . It seems that higher dimensionalities 
give Ptolemaic indexing a greater edge over the triangular one. In the 
experiments performed here, this may be due to the increased number of 
pivots, but even the partial Ptolemaic filtering increases its lead with with 
higher dimension and a higher pivot number, which cannot be explained 
by the quadratic growth in the number of bounds. This is certainly an 
issue worthy of further examination. It also seems like high-dimensional 
spaces are more likely to be approximately Ptolemaic — perhaps because 
the distances are more similar. Based on this reasoning, high-dimensional 
spaces would also be more likely to be approximately triangular. Taking 
the square root, which will increase the intrinsic dimension, will make any 
metric exactly Ptolemaic. On the other hand, both the Ptolemaic and 
the triangular bound will be weaker in these cases, so there is clearly a 
tradeoff between accuracy and filtering power. These are issues that could 
be examined both empirically and analytically. 
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